Analyzing big data frames becomes more and more important for working on biological issues. This is the reason why we want to have a look at such an exploratory analysis. Therefore, we focus on data that report different cellular responses to drug perturbations in cancer treatment. First of all, we will perform a broad exploratory analysis over the whole data set. Then we will continue looking at the specific anticancer drug vorinostat.
First of all, we want to explore the whole data from all 15 drug responses including the treated and untreated data sets. Therefore we perform the following analysis steps:
3. Comparison of treated and untreated gene expression
4. Analysis of Fold Change matrix
5. Principal Component Analysis
Before we are able to start the broad analysis, we load the needed data. The data sets are taken from the NCI Transcriptional Pharmacodynamics Workbench, including the effect on 13.299 genes from 61 cell lines treated with 15 different anticancer agents.
library(readr)
Untreated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_untreated.rds"))
Treated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_treated.rds"))
Metadata = read.table(paste0(wd,"/data/NCI_TPW_metadata.tsv"),
header = TRUE, sep ="\t", stringsAsFactors = TRUE)
Treated <- as.data.frame(Treated)
Untreated <- as.data.frame(Untreated)
Every measurement is influenced by errors, which can arise from different environmental conditions or different softwares which were used for analyzation. Anti cancer drugs are supposed to affect only the expression of few specific target genes instead of increasing or decreasing the expression of the whole genome. Consequently, a comparison of boxplots which show the overall gene expression of each samples should not show distinct differences even if celllines were treated with different drugs.
First of all, we performed a boxplot of the treated data without coloring:
boxplot(Treated,
xlab="sampels", ylab="gene expression",
main= "Gene expression treated samples",
names= FALSE, xaxt= "n")
We can identify different “boxes” in this plot, indicating that we have batches in our data. We assume that the each box belongs to one drug. In order to check this, we now colour the boxes according to the drugs.
# levels for coloring
drug <- Metadata$drug
# 15 diffrent colors, for each drug one
palette(rainbow(15))
# Boxplot
par(mar=c(5, 4, 5, 9))
boxplot(Treated, medcol="black", border = drug, col= drug,
xlab="sampels", ylab="gene expression",
main= "Gene expression treated samples colored by drug",
names= FALSE, xaxt= "n", boxwex=1, boxlty =0)
#add a legend to see which color corresponds to which drug:
levels <- as.factor(levels(drug))
legend("topright", inset = c(-0.4,0), legend= levels(drug), xpd = TRUE, pch=19,
col = levels, title = "drugs")
The different colors indicate the 15 different anticancer drugs. We can can clearly identify 15 different boxes, each box belonging to one medicine. This indicates that we have batches between all 15 drugs. It might be, that the drug treatments were performed on different days under slightly different conditions like air pressure or room temperature.
If we normalize the data, we can remove the batch effects. Our plot will change in the following way:
# normalize the data
Untreated_norm <- apply(Untreated, 2, function(x){
(x - mean(x)) / sd(x)
})
Treated_norm <- apply(Treated, 2, function(x){
(x - mean(x)) / sd(x)
})
# repeat creation of boxplot
# levels for coloring
drug <- Metadata$drug
palette(rainbow(15))
# Boxplot
par(mar=c(5, 4, 5, 9))
boxplot(Treated_norm, medcol="black", border = drug, col= drug,
xlab="sampels", ylab="gene expression",
main= "Gene expression treated celllines with normalized data",
names= FALSE, xaxt= "n", boxwex=1, boxlty =0)
#add a legend to see which color corresponds to which drug:
levels <- as.factor(levels(drug))
legend("topright", inset = c(-0.4,0), legend= levels(drug), xpd = TRUE,
pch=19, col = levels, title = "drugs")
We see that after normalization of the data the batches are almost unrecognizable.
Due to the gene-specific impact of anti-cancer drugs, we expect similar gene expressions of treated and untreated samples as long as we do not focus on target genes.
Untreated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_untreated.rds"))
Treated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_treated.rds"))
par(mar=c(5, 4, 5, 9))
plot(density(Untreated),col="blue" ,xlab = "Gene expression",
main = "Comparison of treated and untreated gene expression ")
lines(density(Treated), col = "red")
legend("topright", inset = c(-0.4,0), legend=c("Untreated", "Treated") , xpd = TRUE,
pch=19, col = c("blue", "red"))
As expected, the density plot demonstrates that the distribution of expression values of treated and untreated samples is almost equal.
We aim to find genes which vary the most in Untreated and Treated celllines. This might be useful to detect outlier genes which were incorrectly measured for some samples. If genes show a highly variable expression in both untreated and treated samples we can better trust their expression values since it is rather unlikely that measurement errors occured twice at the same gene and that drugs that influence only few selected genes affect especially those highly variable genes.
We use the apply-function to calculate the variance rowwise for each gene, afterwards we sort the variances by decreasing value and show the ten highest variable genes for the treated as well as for the untreated data.
VarianceTreated <- apply(Treated_norm, 1, var)
VarianceTreatedSorted <- sort(VarianceTreated, decreasing = TRUE)
VarianceTreatedSorted[1:10]
## KRT19 SPARC SPP1 UCHL1 EPCAM TFPI2 BASP1 GJA1
## 2.138284 2.124631 2.111081 1.968559 1.867512 1.740092 1.730896 1.725725
## TGFBI CD24
## 1.719961 1.715314
VarianceUntreated <- apply(Untreated_norm, 1, var)
VarianceUntreatedSorted <- sort(VarianceUntreated, decreasing = TRUE)
VarianceUntreatedSorted[1:10]
## SPARC KRT19 SPP1 UCHL1 EPCAM CD24 TGFBI GJA1
## 2.228530 2.225019 2.154206 1.937034 1.914216 1.903747 1.895406 1.856253
## BASP1 EFEMP1
## 1.844048 1.727522
For visualization of those highly variable genes we use the package ggplot2. To plot the genes with their variance-values in a two-dimensional coordinate system we assign a number to each gene in order to define the positions of the genes in direction of the x-axis.
x <- c(1:13299)
VarianceTreated <- as.data.frame(VarianceTreated)
VarianceTreatedwithnumbers <- cbind(VarianceTreated, x)
VarianceTreatedwithnumbers <- as.data.frame(VarianceTreatedwithnumbers)
VarianceUntreated <- as.data.frame(VarianceUntreated)
VarianceUntreatedwithnumbers <- cbind(VarianceUntreated, x)
VarianceUntreatedwithnumbers <- as.data.frame(VarianceUntreatedwithnumbers)
Next, we define subgroups of the data set. Only those genes, which have a variance equal to or greater than the tenth largest variance-value, are supposed to be labeled in the plot. Moreover, we use the intersect-function to define a subset which includes only those genes present in both, treated and untreated top ten highly variable genes.
#For treated data:
subsettreated <- subset(VarianceTreatedwithnumbers,
VarianceTreatedwithnumbers$VarianceTreated >= VarianceTreatedSorted[10])
overlap <- intersect(names(VarianceTreatedSorted[1:10]), names(VarianceUntreatedSorted[1:10]))
subsetboth_treated <- VarianceTreatedwithnumbers[overlap,]
#For untreated data:
subsetuntreated <- subset(VarianceUntreatedwithnumbers,
VarianceUntreatedwithnumbers$VarianceUntreated >= VarianceUntreatedSorted[10])
subsetboth_untreated <- VarianceUntreatedwithnumbers[overlap,]
The yellow point represent those genes which are part of the top ten highly variable genes of both untreated and treated data. Only one gene is not present in both top ten highly variable genes, it is highlighted as a red point in the following plots.
library(ggplot2)
#Plot for untreated data:
plotuntreated <- ggplot(VarianceUntreatedwithnumbers, aes (x= x, y= VarianceUntreated ))+
geom_point()+
ggtitle("Genes with highest variance in untreated data")+
geom_text(data= subsetuntreated, label= rownames(subsetuntreated),
hjust=1, vjust=1, size =3)+
geom_point(data = subsetuntreated, colour="red", size = 3)+
geom_point(data= subsetboth_untreated, size = 3, color = "yellow")
#Plot for treated data:
plottreated <- ggplot(VarianceTreatedwithnumbers, aes (x= x, y= VarianceTreated ))+
geom_point()+
ggtitle("Genes with highest variance in treated data")+
geom_text(data=subsettreated, label= rownames(subsettreated),
hjust=1, vjust=1, size =3)+
geom_point(data = subsettreated, colour="red", size = 3)+
geom_point(data= subsetboth_treated, size = 3, color = "yellow")
library(gridExtra)
grid.arrange(plotuntreated, plottreated, nrow = 1)
We can conclude, that most genes which show a very dissimilar expression in different untreated samples behave in the same way in treated samples. Thus, those genes were probably not incorrectly measured but are probably very cell-specific and for sure no housekeeping genes. Moreover, drug treatment did not change the high variation for the matching genes, they might be not clearly affected by the drugs.
The fold change matrix (fold change = FC) describes the changes in gene expression between the treated and the untreated cell lines. First of all we visualize the FC with a bar chart which is colored by different drugs.
# create FC data
FC_all = (Treated - Untreated)
FC_all_mean = colMeans(FC_all)
# create levels for coloring
drug <- Metadata$drug
palette(rainbow(15))
# create boxplot
par(mar=c(5, 4, 5, 9))
barplot( height = FC_all_mean, names= FALSE, col = drug, border = NA,
main= "Fold changes by treatment with 15 anticancer drugs",
xlab="samples", ylab="mean Fold Change values")
# create a legend
levels <- as.factor(levels(drug))
legend("topright", inset = c(-0.4,0.0), legend= levels(drug), xpd = TRUE,
pch=19, col = levels, title = "drugs")
Here we can see that most FC values are in a similar value range.This is beacuse we used the column mean, so we get one value for each sample. Only 5-Azacytidine and bortezomib have clear outliers. Moreover brotezomib and geldanamycin show mostly downregulated genes.
Coloring the scatter plot according to the tissue type shows that there is no correlation between FC values and tissue type, if we look at all drugs:
# scatter plot
par(mar=c(5, 4, 5, 9))
plot(FC_all_mean, col= Metadata$tissue, main="Gene expression change colored by tissues",
xlab="samples",ylab="mean Fold Change values")
# legend
tissue <- Metadata$tissue
levels <- as.factor(levels(tissue))
legend("topright", inset = c(-0.4,0), legend= levels(tissue), xpd = TRUE, pch=19, col = levels,
title = "tissues")
We can identify the Top 10 values from the FC matrix, which indicate the ten most up and down regulated samples by a specific drug in a specific cell line.
### find all min and max
FC_all_min= (apply(FC_all,2,min))
FC_all_max= (apply(FC_all,2,max))
### sort min and max
# most down regulated genes
largest10_FC_all_min <- (sort(FC_all_min, decreasing = F)[1:10])
largest10_FC_all_min =as.data.frame(largest10_FC_all_min)
knitr::kable(largest10_FC_all_min, caption = "10 lowest FC values")
| largest10_FC_all_min | |
|---|---|
| T-47D_vorinostat_5000nM_24h | -9.062011 |
| OVCAR-4_sorafenib_10000nM_24h | -7.601793 |
| NCI-H522_geldanamycin_1000nM_24h | -7.280544 |
| MALME-3M_vorinostat_5000nM_24h | -6.864732 |
| OVCAR-4_doxorubicin_1000nM_24h | -6.843617 |
| UACC-257_vorinostat_5000nM_24h | -6.813984 |
| OVCAR-4_bortezomib_100nM_24h | -6.608203 |
| NCI-H226_bortezomib_100nM_24h | -6.447166 |
| COLO-205_bortezomib_100nM_24h | -6.391190 |
| OVCAR-3_sirolimus_100nM_24h | -6.353995 |
# most up regulated genes
largest10_FC_all_max <- (sort(FC_all_max, decreasing = T)[1:10])
largest10_FC_all_max =as.data.frame(largest10_FC_all_max)
knitr::kable(largest10_FC_all_max, caption = "10 highest FC values")
| largest10_FC_all_max | |
|---|---|
| T-47D_bortezomib_100nM_24h | 9.493492 |
| T-47D_vorinostat_5000nM_24h | 9.162674 |
| SNB-75_bortezomib_100nM_24h | 9.078432 |
| SK-MEL-28_vorinostat_5000nM_24h | 8.906626 |
| NCI-H226_doxorubicin_1000nM_24h | 8.833774 |
| SF-268_bortezomib_100nM_24h | 8.823692 |
| OVCAR-3_bortezomib_100nM_24h | 8.681822 |
| HS-578T_bortezomib_100nM_24h | 8.604213 |
| COLO-205_geldanamycin_1000nM_24h | 8.578627 |
| 786-0_bortezomib_100nM_24h | 8.529357 |
Here we can see, that vorinostat (occurs 3 times for downregulated and 2 times for up regulated), bortezomib (occurs 3 times for downregulated and 6 times for up regulated) seem to have great effects on the gene expression in the celllines. Moreover, the OVCAR-4 cell line, which bellows to the ovarian cancer, occurs the most (3 times) in the down regulated data.
Now we want to see which genes are changed the most. Therefore we are looking for the highest and lowest FC values of our genes.
FC=Treated-Untreated
### most up regulated
samplemax=apply(FC,1, max)
max10=(sort(samplemax, decreasing = T)[1:10])
max10=as.data.frame(max10)
knitr::kable(max10, caption = "most up regulated ")
| max10 | |
|---|---|
| KRT6A | 9.493492 |
| SEPP1 | 9.162674 |
| HSPA6 | 9.078432 |
| ARL14 | 8.906626 |
| CSTA | 8.833774 |
| DHRS2 | 8.578627 |
| CGA | 8.358072 |
| STC1 | 8.336499 |
| FABP4 | 8.250516 |
| CDH1 | 8.248601 |
### most down regulated
samplemax=apply(FC,1, min)
min10=(sort(samplemax, decreasing = F)[1:10])
min10=as.data.frame(min10)
knitr::kable(min10, caption = "most down regulated ")
| min10 | |
|---|---|
| POSTN | -9.062011 |
| SPARC | -8.336483 |
| COL3A1 | -8.140189 |
| TFPI | -7.974582 |
| AXL | -7.892618 |
| MSN | -7.738135 |
| TMEM100 | -7.601793 |
| CAV1 | -7.549387 |
| MAFF | -7.514932 |
| COL4A2 | -7.509959 |
The PCA analysis was performed for two matrices:
For both PCAs, the celllines were colored by 2 different features:
Execute the PCA for Treated data:
treated.pca = prcomp(Treated_norm)
Hereinafter, we want to use information from Metadata to color different celllines in the PCA. Therefore we need to check, if the celllines in the sample-column of Metadata are in the same order as in the Treated matrix. First of all we test, if the number of samples is equal.
Load Metadata:
library(readr)
Metadata = read_tsv(paste0(wd,"/data/NCI_TPW_metadata.tsv"))
identical(nrow(Metadata), ncol(Treated_norm))
## [1] FALSE
nrow(Metadata)
## [1] 1638
ncol(Treated_norm)
## [1] 819
Metadata consists of twice as much celllines as the Treated matrix since Metadata contains information for treated and untreated celllines. We want to print those rows from Metadata which do not contain a zero concentration because they belong to the treated samples.
TreatedrowsMetadata <- grep(Metadata$sample, pattern = "_0nM_", invert = TRUE)
Check, if the sample order is equal in the Treated-matrix and in Metadata:
Metadata <- as.data.frame(Metadata)
Metadatasamples <- Metadata[TreatedrowsMetadata,"sample"]
all(colnames(Treated_norm)== Metadatasamples)
## [1] TRUE
Consequently the drug information of the Metadata-matrix can be assigned to the samples in the Treated-matrix sequentially. For better readability, we assign the column of interest to the name “Metadatadrugs”:
Metadatadrugs <- Metadata[TreatedrowsMetadata,"drug"]
Add Metadatadrugs as a new row to the Treated-matrix:
Treatedwithdrugs <- rbind(Treated_norm, Metadatadrugs)
Save drug information as factors so it can be used for coloring:
drugfactor <- as.factor(Treatedwithdrugs["Metadatadrugs",])
Now we can go on with coloring!
Since we have 15 different drugs we need 15 different colors:
palette(rainbow(15))
Plot Principal component 1 and 2 and add a legend to the plot. To see the PCA plot and the legend next to each other, the graphical parameters are setted by the par() function.
par(mar=c(5, 4, 5, 9))
plot(treated.pca$rotation[, 1], treated.pca$rotation[, 2], pch = 19, xlab = "PC1",
ylab = "PC2", col= drugfactor, main = "PCA Treated colored by drug")
druglevels <- as.factor(levels(drugfactor))
legend("topright", inset = c(-0.4,0),levels(drugfactor), xpd = TRUE, pch=19,
col = druglevels, title = "Drug")
We do not see, that samples treated with the same drug form groups in the plot. However, we did not expect that, since we only look at the final expression and not at the expression change.
The information which is needed for coloring is summarized as Metadatatissue:
Metadatatissue <- Metadata[TreatedrowsMetadata,"tissue"]
Bind Metadatatissue as a new row to the Treated matrix:
Treatedwithtissue <- rbind(Treated_norm, Metadatatissue)
Save tissue information as factors so it can be used for coloring:
tissuefactor <- as.factor(Treatedwithtissue["Metadatatissue",])
Since we have 9 different tissue types we need 9 different colors:
palette(rainbow(9))
Plot PC 1 and PC 2 and add a legend:
par(mar=c(5, 4, 5, 9))
plot(treated.pca$rotation[, 1], treated.pca$rotation[, 2], pch = 19, xlab = "PC1",
ylab = "PC2", col= tissuefactor, main= "PCA Treated colored by tissue")
levels <- as.factor(levels(tissuefactor))
tissuelevels <- as.factor(levels(tissuefactor))
legend("topright", inset = c(-0.3,0), levels(tissuefactor), xpd = TRUE, pch=19,
col = tissuelevels, title = "Tissue")
PC 1 and PC 2 group the treated celllines as well as other PC combinations. Thus, most of the celllines of the same tissue-type seem to have similarities regarding their gene expression.
We execute the PCA with normalized FC data:
FC_norm <- Treated_norm - Untreated_norm
pca.FC = prcomp(FC_norm)
We want to see how much variance is explained by each principle component:
plot(pca.FC, type = "l", main = "Variances explained by the Principal Components")
We can interpret, that PC 1 explains clearly the greatest part of the variance whereas a much lower proportion is explained by PC2. However, PC2 to PC6 approximately form a line until a weak “elbow” can be seen after the sixth PC. Nevertheless, we should not exclude PCs explaining less variance from our further analysis.
Bind the tissue-information as a new row to the FC matrix:
FCwithtissue <- rbind(FC_norm, Metadatatissue)
Save tissue information as factors so it can be used for coloring:
tissuefactorFC <- as.factor(FCwithtissue["Metadatatissue",])
We plotted different PCs to see which combination groups the samples best. However, different tissues do not seem to group the points in any PC combination.
Example: PC1 and PC2 do not group celllines of same tissue-type:
palette(rainbow(9))
par(mar=c(5, 4, 5, 9))
plot(pca.FC$rotation[, 1], pca.FC$rotation[, 2], col = tissuefactorFC, pch = 19, xlab = "PC1",
ylab = "PC2", main = "PCA FC colored by tissue")
levels <- as.factor(levels(tissuefactorFC))
legend("topright", inset = c(-0.3,0), levels(tissuefactorFC), xpd = TRUE, pch=19,
col = tissuelevels, title = "Tissue")
Since we are not able to identify groups of celllines of the same tissue, fold changes might be not very tissue-specific.
Create a new matrix (“FCwithdrugs”) where the druginformation is added as a new row to the FC-matrix:
FCwithdrugs <- rbind(FC_norm, Metadatadrugs)
Save drug information as factors so it can be used for coloring:
drugfactorFC <- as.factor(FCwithdrugs["Metadatadrugs",])
Plot PC 1 & PC 2:
palette(rainbow(15))
par(mar=c(5, 4, 5, 9))
plot(pca.FC$rotation[, 1], pca.FC$rotation[, 2], col = drugfactorFC , pch = 19, xlab = "PC1"
, ylab = "PC2", main = "PCA FC colored by drugs")
levels <- as.factor(levels(drugfactorFC))
legend("topright", inset = c(-0.4,0), levels(drugfactorFC), xpd = TRUE, pch=19,
col = druglevels, title = "Drugs")
Plot PC 2 and PC 3:
par(mar=c(5,4,5,9))
plot(pca.FC$rotation[, 2], pca.FC$rotation[, 3], col = drugfactorFC , pch = 19, xlab = "PC2",
ylab = "PC3", main = "PCA FC colored by drug")
levels <- as.factor(levels(drugfactorFC))
legend("topright", inset = c(-0.4,0), levels(drugfactorFC), xpd = TRUE, pch=19,
col = druglevels, title = "Drugs")
Many combinations of Principal Components clearly group celllines treated with the same drug. Consequently, the FC of celllines seems to be drug-specific.
Since we are going to analyze the effects of Vorinostat in our specific analysis we want to plot a PCA that highlights exclusively those celllines which belong to Vorinostattreatment.
Therefore we use the ifelse-function:
Metadata <-as.data.frame(Metadata)
Marking <- ifelse(Metadata$drug == "vorinostat", "yellow", "black")
Add the information, whether samples belong to Vorinostat, to the FC matrix:
HighlightVorinostat <- cbind(`FC_norm` = Marking)
Plot PC 1 and PC 2:
par(mar=c(5, 4, 5, 9))
plot(pca.FC$rotation[, 1], pca.FC$rotation[, 2], col = HighlightVorinostat, pch = 19,
xlab = "PC1", ylab = "PC2", main = "PCA FC Highlighted Vorinostat samples")
legend("topright", inset = c(-0.3,0), legend = c("Vorinostat","Other drugs"),
xpd = TRUE, pch=19, col = c("yellow", "black"))